Challenge 8

challenge_8

faostat

Joining Data

Author

Mekhala Kumar

Published

August 25, 2022

library(tidyverse)
library(ggplot2)
library(readr)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Description of the data

Two datasets have been used for this dataset- the FAOSTAT cattle and the FAOSTAT chicken ones. They are time-series datasets with information about the amount of eggs and chicken yielded or produced. The information is present for several countries.

dairy <- read_csv("_data/FAOSTAT_cattle_dairy.csv")
egg <- read_csv("_data/FAOSTAT_egg_chicken.csv")
dim(dairy)

[1] 36449    14

dim(egg)

[1] 38170    14

print(summarytools::dfSummary(dairy,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

dairy

Dimensions: 36449 x 14
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

Domain Code [character]

1. QL

36449

(

100.0%

)

0 (0.0%)

Domain [character]

1. Livestock Primary

36449

(

100.0%

)

0 (0.0%)

Area Code [numeric]

Mean (sd) : 775.2 (1713.8)

min ≤ med ≤ max:

1 ≤ 141 ≤ 5504

IQR (CV) : 146 (2.2)

232 distinct values

0 (0.0%)

Area [character]

1. Afghanistan

2. Africa

3. Albania

4. Algeria

5. American Samoa

6. Americas

7. Angola

8. Antigua and Barbuda

9. Argentina

10. Asia

[ 222 others ]

174	(	0.5%	)
174	(	0.5%	)
174	(	0.5%	)
174	(	0.5%	)
174	(	0.5%	)
174	(	0.5%	)
174	(	0.5%	)
174	(	0.5%	)
174	(	0.5%	)
174	(	0.5%	)
34709	(	95.2%	)

0 (0.0%)

Element Code [numeric]

Mean (sd) : 5416 (78.5)

min ≤ med ≤ max:

5318 ≤ 5420 ≤ 5510

IQR (CV) : 192 (0)

5318	:	12158	(	33.4%	)
5420	:	12121	(	33.3%	)
5510	:	12170	(	33.4%	)

0 (0.0%)

Element [character]

1. Milk Animals

2. Production

3. Yield

12158	(	33.4%	)
12170	(	33.4%	)
12121	(	33.3%	)

0 (0.0%)

Item Code [numeric]

1 distinct value

882

36449

(

100.0%

)

0 (0.0%)

Item [character]

1. Milk, whole fresh cow

36449

(

100.0%

)

0 (0.0%)

Year Code [numeric]

Mean (sd) : 1990.4 (16.8)

min ≤ med ≤ max:

1961 ≤ 1991 ≤ 2018

IQR (CV) : 29 (0)

58 distinct values

0 (0.0%)

Year [numeric]

Mean (sd) : 1990.4 (16.8)

min ≤ med ≤ max:

1961 ≤ 1991 ≤ 2018

IQR (CV) : 29 (0)

58 distinct values

0 (0.0%)

Unit [character]

1. Head

2. hg/An

3. tonnes

12158	(	33.4%	)
12121	(	33.3%	)
12170	(	33.4%	)

0 (0.0%)

Value [numeric]

Mean (sd) : 4410235 (25744621)

min ≤ med ≤ max:

7 ≤ 43266 ≤ 683217055

IQR (CV) : 692151 (5.8)

24088 distinct values

74 (0.2%)

Flag [character]

1. *

2. A

3. F

4. Fc

5. Im

6. M

810	(	3.1%	)
3070	(	11.6%	)
7045	(	26.7%	)
13136	(	49.7%	)
2270	(	8.6%	)
74	(	0.3%	)

10044 (27.6%)

Flag Description [character]

1. Aggregate, may include of

2. Calculated data

3. Data not available

4. FAO data based on imputat

5. FAO estimate

6. Official data

7. Unofficial figure

3070	(	8.4%	)
13136	(	36.0%	)
74	(	0.2%	)
2270	(	6.2%	)
7045	(	19.3%	)
10044	(	27.6%	)
810	(	2.2%	)

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-08-25

print(summarytools::dfSummary(egg,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

egg

Dimensions: 38170 x 14
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

Domain Code [character]

1. QL

38170

(

100.0%

)

0 (0.0%)

Domain [character]

1. Livestock Primary

38170

(

100.0%

)

0 (0.0%)

Area Code [numeric]

Mean (sd) : 771.1 (1709.6)

min ≤ med ≤ max:

1 ≤ 143 ≤ 5504

IQR (CV) : 145 (2.2)

245 distinct values

0 (0.0%)

Area [character]

1. Afghanistan

2. Africa

3. Albania

4. Algeria

5. American Samoa

6. Americas

7. Angola

8. Antigua and Barbuda

9. Argentina

10. Asia

[ 235 others ]

174	(	0.5%	)
174	(	0.5%	)
174	(	0.5%	)
174	(	0.5%	)
174	(	0.5%	)
174	(	0.5%	)
174	(	0.5%	)
174	(	0.5%	)
174	(	0.5%	)
174	(	0.5%	)
36430	(	95.4%	)

0 (0.0%)

Element Code [numeric]

Mean (sd) : 5411.4 (80.5)

min ≤ med ≤ max:

5313 ≤ 5410 ≤ 5510

IQR (CV) : 197 (0)

5313	:	12679	(	33.2%	)
5410	:	12651	(	33.1%	)
5510	:	12840	(	33.6%	)

0 (0.0%)

Element [character]

1. Laying

2. Production

3. Yield

12679	(	33.2%	)
12840	(	33.6%	)
12651	(	33.1%	)

0 (0.0%)

Item Code [numeric]

1 distinct value

1062

38170

(

100.0%

)

0 (0.0%)

Item [character]

1. Eggs, hen, in shell

38170

(

100.0%

)

0 (0.0%)

Year Code [numeric]

Mean (sd) : 1990.5 (16.7)

min ≤ med ≤ max:

1961 ≤ 1991 ≤ 2018

IQR (CV) : 29 (0)

58 distinct values

0 (0.0%)

Year [numeric]

Mean (sd) : 1990.5 (16.7)

min ≤ med ≤ max:

1961 ≤ 1991 ≤ 2018

IQR (CV) : 29 (0)

58 distinct values

0 (0.0%)

Unit [character]

1. 1000 Head

2. 100mg/An

3. tonnes

12679	(	33.2%	)
12651	(	33.1%	)
12840	(	33.6%	)

0 (0.0%)

Value [numeric]

Mean (sd) : 291341.2 (2232761)

min ≤ med ≤ max:

1 ≤ 31996 ≤ 76769955

IQR (CV) : 91235.8 (7.7)

21325 distinct values

40 (0.1%)

Flag [character]

1. *

2. A

3. F

4. Fc

5. Im

6. M

1435	(	4.7%	)
3186	(	10.4%	)
10538	(	34.4%	)
13344	(	43.6%	)
2079	(	6.8%	)
40	(	0.1%	)

7548 (19.8%)

Flag Description [character]

1. Aggregate, may include of

2. Calculated data

3. Data not available

4. FAO data based on imputat

5. FAO estimate

6. Official data

7. Unofficial figure

3186	(	8.3%	)
13344	(	35.0%	)
40	(	0.1%	)
2079	(	5.4%	)
10538	(	27.6%	)
7548	(	19.8%	)
1435	(	3.8%	)

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-08-25

Tidy Data (as needed)

The data is already tidy in both the datasets. In the item column, there was unnecessary text so I removed that.

dairy<-dairy%>%mutate(Item=str_remove(Item,", whole fresh cow"))
egg<-egg%>%mutate(Item=str_remove(Item,", hen, in shell"))

Join Data

At first, I joined the datasets by doing an inner join and using the common column as year. However, both the datasets, have the same column names and information and using inner join created more columns, essentially a duplicate, with the same information.
I realised that I wanted to add the rows from the FAOSTAT chicken dataset to the existing FAOSTAT cattle dataset. Therefore, I used rbind() to join the two datasets.

#full_join()
dairy_egg<-dairy %>% 
  inner_join(egg, by = "Year")

head(dairy_egg)

# A tibble: 6 × 27
  Domain C…¹ Domai…² Area …³ Area.x Eleme…⁴ Eleme…⁵ Item …⁶ Item.x Year …⁷  Year
  <chr>      <chr>     <dbl> <chr>    <dbl> <chr>     <dbl> <chr>    <dbl> <dbl>
1 QL         Livest…       2 Afgha…    5318 Milk A…     882 Milk      1961  1961
2 QL         Livest…       2 Afgha…    5318 Milk A…     882 Milk      1961  1961
3 QL         Livest…       2 Afgha…    5318 Milk A…     882 Milk      1961  1961
4 QL         Livest…       2 Afgha…    5318 Milk A…     882 Milk      1961  1961
5 QL         Livest…       2 Afgha…    5318 Milk A…     882 Milk      1961  1961
6 QL         Livest…       2 Afgha…    5318 Milk A…     882 Milk      1961  1961
# … with 17 more variables: Unit.x <chr>, Value.x <dbl>, Flag.x <chr>,
#   `Flag Description.x` <chr>, `Domain Code.y` <chr>, Domain.y <chr>,
#   `Area Code.y` <dbl>, Area.y <chr>, `Element Code.y` <dbl>, Element.y <chr>,
#   `Item Code.y` <dbl>, Item.y <chr>, `Year Code.y` <dbl>, Unit.y <chr>,
#   Value.y <dbl>, Flag.y <chr>, `Flag Description.y` <chr>, and abbreviated
#   variable names ¹`Domain Code.x`, ²Domain.x, ³`Area Code.x`,
#   ⁴`Element Code.x`, ⁵Element.x, ⁶`Item Code.x`, ⁷`Year Code.x`
# ℹ Use `colnames()` to see all variable names

tail(dairy_egg)

# A tibble: 6 × 27
  Domain C…¹ Domai…² Area …³ Area.x Eleme…⁴ Eleme…⁵ Item …⁶ Item.x Year …⁷  Year
  <chr>      <chr>     <dbl> <chr>    <dbl> <chr>     <dbl> <chr>    <dbl> <dbl>
1 QL         Livest…    5504 Polyn…    5510 Produc…     882 Milk      2018  2018
2 QL         Livest…    5504 Polyn…    5510 Produc…     882 Milk      2018  2018
3 QL         Livest…    5504 Polyn…    5510 Produc…     882 Milk      2018  2018
4 QL         Livest…    5504 Polyn…    5510 Produc…     882 Milk      2018  2018
5 QL         Livest…    5504 Polyn…    5510 Produc…     882 Milk      2018  2018
6 QL         Livest…    5504 Polyn…    5510 Produc…     882 Milk      2018  2018
# … with 17 more variables: Unit.x <chr>, Value.x <dbl>, Flag.x <chr>,
#   `Flag Description.x` <chr>, `Domain Code.y` <chr>, Domain.y <chr>,
#   `Area Code.y` <dbl>, Area.y <chr>, `Element Code.y` <dbl>, Element.y <chr>,
#   `Item Code.y` <dbl>, Item.y <chr>, `Year Code.y` <dbl>, Unit.y <chr>,
#   Value.y <dbl>, Flag.y <chr>, `Flag Description.y` <chr>, and abbreviated
#   variable names ¹`Domain Code.x`, ²Domain.x, ³`Area Code.x`,
#   ⁴`Element Code.x`, ⁵Element.x, ⁶`Item Code.x`, ⁷`Year Code.x`
# ℹ Use `colnames()` to see all variable names

dim(dairy_egg)

[1] 24076265       27

new_dairy_egg<-rbind(dairy,egg)
head(new_dairy_egg)

# A tibble: 6 × 14
  Domai…¹ Domain Area …² Area  Eleme…³ Element Item …⁴ Item  Year …⁵  Year Unit 
  <chr>   <chr>    <dbl> <chr>   <dbl> <chr>     <dbl> <chr>   <dbl> <dbl> <chr>
1 QL      Lives…       2 Afgh…    5318 Milk A…     882 Milk     1961  1961 Head 
2 QL      Lives…       2 Afgh…    5420 Yield       882 Milk     1961  1961 hg/An
3 QL      Lives…       2 Afgh…    5510 Produc…     882 Milk     1961  1961 tonn…
4 QL      Lives…       2 Afgh…    5318 Milk A…     882 Milk     1962  1962 Head 
5 QL      Lives…       2 Afgh…    5420 Yield       882 Milk     1962  1962 hg/An
6 QL      Lives…       2 Afgh…    5510 Produc…     882 Milk     1962  1962 tonn…
# … with 3 more variables: Value <dbl>, Flag <chr>, `Flag Description` <chr>,
#   and abbreviated variable names ¹`Domain Code`, ²`Area Code`,
#   ³`Element Code`, ⁴`Item Code`, ⁵`Year Code`
# ℹ Use `colnames()` to see all variable names

tail(new_dairy_egg)

# A tibble: 6 × 14
  Domai…¹ Domain Area …² Area  Eleme…³ Element Item …⁴ Item  Year …⁵  Year Unit 
  <chr>   <chr>    <dbl> <chr>   <dbl> <chr>     <dbl> <chr>   <dbl> <dbl> <chr>
1 QL      Lives…    5504 Poly…    5313 Laying     1062 Eggs     2017  2017 1000…
2 QL      Lives…    5504 Poly…    5410 Yield      1062 Eggs     2017  2017 100m…
3 QL      Lives…    5504 Poly…    5510 Produc…    1062 Eggs     2017  2017 tonn…
4 QL      Lives…    5504 Poly…    5313 Laying     1062 Eggs     2018  2018 1000…
5 QL      Lives…    5504 Poly…    5410 Yield      1062 Eggs     2018  2018 100m…
6 QL      Lives…    5504 Poly…    5510 Produc…    1062 Eggs     2018  2018 tonn…
# … with 3 more variables: Value <dbl>, Flag <chr>, `Flag Description` <chr>,
#   and abbreviated variable names ¹`Domain Code`, ²`Area Code`,
#   ³`Element Code`, ⁴`Item Code`, ⁵`Year Code`
# ℹ Use `colnames()` to see all variable names

dim(new_dairy_egg)

[1] 74619    14

print(summarytools::dfSummary(new_dairy_egg,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

new_dairy_egg

Dimensions: 74619 x 14
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

Domain Code [character]

1. QL

74619

(

100.0%

)

0 (0.0%)

Domain [character]

1. Livestock Primary

74619

(

100.0%

)

0 (0.0%)

Area Code [numeric]

Mean (sd) : 773.1 (1711.6)

min ≤ med ≤ max:

1 ≤ 142 ≤ 5504

IQR (CV) : 146 (2.2)

248 distinct values

0 (0.0%)

Area [character]

1. Afghanistan

2. Africa

3. Albania

4. Algeria

5. American Samoa

6. Americas

7. Angola

8. Antigua and Barbuda

9. Argentina

10. Asia

[ 238 others ]

348	(	0.5%	)
348	(	0.5%	)
348	(	0.5%	)
348	(	0.5%	)
348	(	0.5%	)
348	(	0.5%	)
348	(	0.5%	)
348	(	0.5%	)
348	(	0.5%	)
348	(	0.5%	)
71139	(	95.3%	)

0 (0.0%)

Element Code [numeric]

Mean (sd) : 5413.7 (79.6)

min ≤ med ≤ max:

5313 ≤ 5410 ≤ 5510

IQR (CV) : 192 (0)

5313	:	12679	(	17.0%	)
5318	:	12158	(	16.3%	)
5410	:	12651	(	17.0%	)
5420	:	12121	(	16.2%	)
5510	:	25010	(	33.5%	)

0 (0.0%)

Element [character]

1. Laying

2. Milk Animals

3. Production

4. Yield

12679	(	17.0%	)
12158	(	16.3%	)
25010	(	33.5%	)
24772	(	33.2%	)

0 (0.0%)

Item Code [numeric]

Min : 882

Mean : 974.1

Max : 1062

882	:	36449	(	48.8%	)
1062	:	38170	(	51.2%	)

0 (0.0%)

Item [character]

1. Eggs

2. Milk

38170	(	51.2%	)
36449	(	48.8%	)

0 (0.0%)

Year Code [numeric]

Mean (sd) : 1990.4 (16.8)

min ≤ med ≤ max:

1961 ≤ 1991 ≤ 2018

IQR (CV) : 29 (0)

58 distinct values

0 (0.0%)

Year [numeric]

Mean (sd) : 1990.4 (16.8)

min ≤ med ≤ max:

1961 ≤ 1991 ≤ 2018

IQR (CV) : 29 (0)

58 distinct values

0 (0.0%)

Unit [character]

1. 1000 Head

2. 100mg/An

3. Head

4. hg/An

5. tonnes

12679	(	17.0%	)
12651	(	17.0%	)
12158	(	16.3%	)
12121	(	16.2%	)
25010	(	33.5%	)

0 (0.0%)

Value [numeric]

Mean (sd) : 2302277 (18176150)

min ≤ med ≤ max:

1 ≤ 36400 ≤ 683217055

IQR (CV) : 157045 (7.9)

41824 distinct values

114 (0.2%)

Flag [character]

1. *

2. A

3. F

4. Fc

5. Im

6. M

2245	(	3.9%	)
6256	(	11.0%	)
17583	(	30.8%	)
26480	(	46.4%	)
4349	(	7.6%	)
114	(	0.2%	)

17592 (23.6%)

Flag Description [character]

1. Aggregate, may include of

2. Calculated data

3. Data not available

4. FAO data based on imputat

5. FAO estimate

6. Official data

7. Unofficial figure

6256	(	8.4%	)
26480	(	35.5%	)
114	(	0.2%	)
4349	(	5.8%	)
17583	(	23.6%	)
17592	(	23.6%	)
2245	(	3.0%	)

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-08-25

Analysis

Before analysis, I filtered the dataset to only have the production values since these were measured by the same quantity(tonnes) for milk and eggs. Hence they could be compared. Moreover, I filtered the data to have values from India.
In the scatterplot below, it is evident that the milk production has increased at a rapid rate from 1961 to 2018. This could correspond to the increase in population and popular consumption of milk. However, the eggs production has not had such a rapid increase in the same time period. This may be attributed to the fact that several Indians follow a vegetarian diet which does not include the consumption of eggs.

new_dairy_egg_sub<-new_dairy_egg%>%filter(Element=="Production")%>%filter(Area=="India")
#year,item,value

ggplot(new_dairy_egg_sub, aes(x=Year,y=Value, color=Item)) +
     geom_point() +
     labs(title = "Production of dairy and eggs across years (in tonnes) in India")